{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CTW dataset tutorial (Part 3: detection baseline)\n",
    "\n",
    "In this part of the turotial, we will show you:\n",
    "\n",
    "  - [Framework of detection baseline](#Framework-of-detection-baseline)\n",
    "  - [Training steps](#Training-steps)\n",
    "  - [Predicting steps](#Predicting-steps)\n",
    "  - [Submission format](#Submission-format)\n",
    "  - [Evaluation API](#Evaluation-API)\n",
    "  - [Evaluate results locally](#Evaluate-results-locally)\n",
    "  - [Appendix: Details to the evaluation API](#Appendix:-Details-to-the-evaluation-API)\n",
    "\n",
    "Notes:\n",
    "\n",
    "  > This notebook MUST be run under `$CTW_ROOT/tutorial`.\n",
    "  >\n",
    "  > All the code SHOULD be run with `Python>=3.4`. We make it compatible with `Python>=2.7` with best effort."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Framework of detection baseline\n",
    "\n",
    "We use [YOLOv2](https://pjreddie.com/darknet/yolo/) and slightly modified it, detailed in git commits. We perform a multiscale testing scheme, which is described in our paper.\n",
    "\n",
    "Follow the classification task, we also limit the number of categories to 1001, i.e., the top 1000 frequent observed character categories and an 'others' category.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training steps\n",
    "\n",
    "Something are similar to classification tutorial, so this tutorial is a little simplified.\n",
    "\n",
    "To train SSD512 model, just substitute `../detection/` with `../ssd/`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compile darknet and download pre-trained model\n",
    "\n",
    "Firstly, initialize git submodules. If you have problem initializing submodules, you may manually download darknet and copy it to corresponding directory. Note that we have slightly modified darknet, so you should clone the repository described in `$CTW_ROOT/.gitmodules` and notice its `branch`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "!git submodule update --init --recursive\n",
    "!cd ../detection/darknet && make -j8\n",
    "!cd ../detection && if [ ! -f \"products/darknet19_448.conv.23\" ]; then curl https://pjreddie.com/media/files/darknet19_448.conv.23 -o products/darknet19_448.conv.23; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decide categories\n",
    "\n",
    "Decide which categories are the top 1000 frequent observed character categories, and save to `products/cates.json`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../detection && python3 decide_cates.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Crop images and write meta data\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - detection/products/trainval/\\*.\\{jpg,txt\\}\n",
    "  - detection/products/trainval.txt\n",
    "  - detection/products/yolo-chinese.cfg\n",
    "  - detection/products/chinese.data\n",
    "  - detection/products/chinese.names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../detection && python3 prepare_train_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run train scripts\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - detection/products/backup/\\*.weights\n",
    "\n",
    "This script outputs a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.\n",
    "\n",
    "- Time cost estimation for YOLOv2 (NVIDIA GTX TITAN X): 3.0 sec / step, 38 hours in total.\n",
    "- Time cost estimation for SSD512 (NVIDIA GTX TITAN X * 2): 1.8 sec / step, 60 hours in total."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cd ../detection && python3 train.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download trained models\n",
    "\n",
    "Since training takes a lot of energy and we hate global warming, we provide trained models which are trained using TRAIN+VAL.\n",
    "\n",
    "Visit our homepage (https://ctwdataset.github.io/) and gain access to the trained models.\n",
    "\n",
    "Notes: if you are using trained models, you may run training steps with fake empty training data to produce some necessary files.\n",
    "\n",
    "1. Produce `cates.json` with TRAIN+VAL, the map from label ID to character category.\n",
    "\n",
    "  ```sh\n",
    "cp ../data/annotations/downloads/train.jsonl ../data/annotations/downloads/val.jsonl ../data/annotations/\n",
    "python3 decide_cates.py\n",
    "```\n",
    "\n",
    "1. Fake 'empty' training data.\n",
    "\n",
    "  ```sh\n",
    "echo '{\"train\":[{\"image_id\":\"0000172\",\"file_name\":\"0000172.jpg\"}],\"val\":[{\"image_id\":\"0000486\",\"file_name\":\"0000486.jpg\"}],\"test_cls\":[],\"test_det\":[{\"image_id\":\"0000238\",\"file_name\":\"0000238.jpg\"}]}' >../data/annotations/info.json\n",
    "head ../data/annotations/downloads/train.jsonl -n1 >../data/annotations/train.jsonl\n",
    "head ../data/annotations/downloads/val.jsonl -n1 >../data/annotations/val.jsonl\n",
    "```\n",
    "\n",
    "1. Run training scripts with fake data.\n",
    "\n",
    "  ```sh\n",
    "python3 prepare_train_data.py\n",
    "python3 train.py\n",
    "```\n",
    "\n",
    "  If pre-trained model is required, just touch it or download it. If it starts to train, press `CTRL+C`.\n",
    "\n",
    "1. Replace the model with downloaded trained model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Crop testing images and write meta data\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - detection/products/test/\\*.\\{jpg,txt\\}\n",
    "  - detection/products/test.txt\n",
    "  - detection/products/yolo-chinese-test.cfg\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../detection && python3 prepare_test_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run darknet\n",
    "\n",
    "You may need to edit `TEST_NUM_GPU` (in `detection/settings.py`) and `num_thread` (in `detection/eval.py`) before running this step. Each copy of YOLOv2 takes about 3.6 GB GPU memory. Each GPU will run `num_thread / TEST_NUM_GPU` copies of YOLOv2 at the same time.\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - `detection/products/chinese.*.data`\n",
    "  - `detection/products/test.*.txt`\n",
    "  - `detection/products/results/chinese.*.txt`\n",
    "\n",
    "This script outputs a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.\n",
    "\n",
    "Time cost estimation for YOLOv2 (NVIDIA GTX TITAN X \\* 2): 0.2 sec / subimage for each thread, 2.4 hours in total if using 6 threads.\n",
    "\n",
    "Notes:\n",
    "\n",
    "  > For validation set, which size is about $0.5$ times detection testing set, time cost estimation is 1.2 hours in total."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cd ../detection && python3 eval.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merge results\n",
    "\n",
    "We don't apply non-maximum suppression (NMS) on each subimage in YOLOv2.\n",
    "\n",
    "1. Collect candidates from results of subimages from YOLOv2.\n",
    "2. Remove candidates in improper size.\n",
    "3. Splice candidates on subimages to full images.\n",
    "4. Apply NMS on full images.\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - detection/products/detections.jsonl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cd ../detection && python3 merge_results.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission format\n",
    "\n",
    "Detection submission MUST be UTF-8 encoded [JSON Lines](http://jsonlines.org/), each line MUST match the corresponding item in `test_det` in overall information (`$CTW_ROOT/data/annotations/info.json`), which is described in `Tutorial part-1: Basics`.\n",
    "\n",
    "```\n",
    "result (corresponding to one line in .jsonl):\n",
    "{\n",
    "    detections: [detection_0, detection_1, detection_2, ...],  # length of this list MUST <=1000\n",
    "}\n",
    "\n",
    "detection:\n",
    "{\n",
    "    bbox: [x, y, w, h],          # x, y, w, h are floating-point numbers, where w, h MUST be greater than 0\n",
    "    text: str,                   # length is usually 1, otherwise this detection must be a false positive\n",
    "    score: float,\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation API\n",
    "\n",
    "The calculation of `AP` is similar to PASCAL VOC. More evaluation metrics are implemented in `cppapi/eval_tools.hpp`. Here is a brief description, more details can be found in [Appendix](#Appendix:-Details-to-the-evaluation-API).\n",
    "\n",
    "  1. At most 1000 detections are allowed for each image.\n",
    "  1. IOU threshold is $0.5$.\n",
    "  1. A detection which does not match\n",
    "a ground truth but matches a subregion of an 'ignore' region is excluded during the evaluation.\n",
    "  1. `macro-mAP` (also called `mAP`) is mean over character categories, weighted by number of character instances in corresponding category.\n",
    "  1. `micro-mAP` is mean over images, where each image has the equivalent weight.\n",
    "  1. `recall` is computed as described in the paper.\n",
    "  1. When computing metrics for a specified size range,\n",
    "    1. a detection (DT) out of size range is excluded during the evaluation,\n",
    "    1. a ground truth (GT) out of size range is exluded,\n",
    "    1. a DT in range matches a GT out of range is excluded, a DT out of range matches a GT in range is included.\n",
    "\n",
    "If no error occurred, the data struct for the output of evaluation API is described below.\n",
    "\n",
    "```\n",
    "output:\n",
    "{\n",
    "    error: 0,\n",
    "    performance: {\n",
    "        all: size_performance,\n",
    "        large: size_performance,\n",
    "        medium: size_performance,\n",
    "        small: size_performance,\n",
    "    },\n",
    "}\n",
    "\n",
    "size_performance:\n",
    "{\n",
    "    n: int,                                                             # number of GTs\n",
    "    AP: float,\n",
    "    AP_curve: Y_curve,\n",
    "    mAP: float,                                                         # i.e., macro-mAP\n",
    "    mAP_curve: XY_curve,                                                # only C++ API computes mAP_curve\n",
    "    attributes: [recall_0, recall_1, recall_2, ..., recall_63],         # recall for each attribute. Index is bitwise, described in 'part-2: classification'\n",
    "    texts: {str_0: recall_0, str_1: recall_1, str_2: recall_2, ...},    # recall for each character category\n",
    "    mAP_micro: float,\n",
    "}\n",
    "\n",
    "XY_curve:\n",
    "[(x_0, y_0), (x_1, y_1), (x_2, y_2), ...]\n",
    "\n",
    "Y_curve:\n",
    "[y_0, y_1, y_2, ...]      # of which X are 1/n, 2/n, 3/n, ..., respectively\n",
    "\n",
    "recall: {\n",
    "    n: int,\n",
    "    recall: int,\n",
    "}\n",
    "```\n",
    "\n",
    "The data struct for the output of evaluation server is described below.\n",
    "\n",
    "```\n",
    "evaluation server output:\n",
    "{\n",
    "    size_ranges: list,    # the configure of considered sizes on the evaluation server, defined in `codalab/settings.py`\n",
    "    attributes: list,     # the configure of considered attributes, always [\"occluded\", \"bgcomplex\", etc.]\n",
    "    max_det: int,         # the configure of limit of number of detections per image, always 1000\n",
    "    iou_thresh: float,    # the configure of IOU threshold, always 0.5\n",
    "    performance: {\n",
    "        all: size_performance,\n",
    "        large: size_performance,\n",
    "        medium: size_performance,\n",
    "        small: size_performance,\n",
    "    },\n",
    "}\n",
    "```\n",
    "\n",
    "The `size_performance` in the output of evaluation server slightly differs from the output of our evaluation API.\n",
    "\n",
    "  - `AP_curve` is discretize to `AP_curve_discrete`, which type is `XY_curve`, to avoid someone can infer which DTs are truths from the curve, and to reduce the size of output file.\n",
    "  - `mAP_curve` is discretize to `mAP_curve_discrete`, for the same reason.\n",
    "  - `texts` only contains top-10 frequent categories, to avoid revealing the frequency of each category on testing set, and to reduce the size of output file.\n",
    "\n",
    "AP of size `all` should be considered the single most important metric on CTW dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate results locally\n",
    "\n",
    "Since the following steps rely on `judge/products/stat_frequency.json`, you SHOULD firstly gather statistics, which is described in `Gather statistics` section in `Tutorial part-2: classification`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate detection performance\n",
    "\n",
    "We use [rapidjson](https://github.com/Tencent/rapidjson) library in C++ API, please initialize submodules or download this library to `cppapi/rapidjson` manually.\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - `<stdout>`: detection performance for each size and each attribute\n",
    "  - `<stdout>`: detection performance of top-10 most frequent character categories for each size\n",
    "  - `judge/products/plots/det_AP_curve.pdf`: (described in our paper)\n",
    "  - `judge/products/plots/det_mAP_curve.pdf`: macro-mAP curve is mean over AP curves for each category\n",
    "  - `judge/products/plots/det_recall_by_attr_size.pdf`: recall for each size and each attribute, respectively\n",
    "  - `judge/products/detection_report.json`: the output of evaluation API described above\n",
    "  - `judge/products/explore_det.html`: performance for each conbination of attributes and each size\n",
    "\n",
    "Notes:\n",
    "\n",
    "  > If you are using trained models and validation set, you may result in a higher performance than paper. The reason may be our models are trained on TRAIN+VAL, while you are using validation set as testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!git submodule update --init --recursive\n",
    "!cd ../judge && python3 detection_perf.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Draw detections on images\n",
    "\n",
    "We show TPs in <font color=\"#0f0\">**green**</font>, and show FPs in <font color=\"#ff0\">**yellow**</font>. For each image, we draw most confident DTs, and set $num(TPs) + num(FPs) = num(GTs)$.\n",
    "\n",
    "We will output the following files in this step:\n",
    "\n",
    "  - `judge/products/printtext-drawing/*.pdf`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../judge && python3 draw_detection_text.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix: Details to the evaluation API\n",
    "\n",
    "Our Python evaluation API in `pythonapi/eval_tools.py` and C++ evaluation API `cppapi/eval_tools.hpp` work as follows, which is similar to PASCAL VOC.\n",
    "\n",
    "  1. Check the detection (DT) file has the same number of lines as the ground truth (GT) file. Otherwise, return error.\n",
    "  1. Check each line of the DT file is valid JSON, and conform to the submission format. Otherwise, return error.\n",
    "  1. Use the ignore list (IG) in the annotations.\n",
    "  1. Remove non-Chinese character instances from GTs.\n",
    "  1. For each size, for any given confidence score $c_0$, we deal with DTs, GTs, and IGs of each image in the following steps, respectively.\n",
    "    1. Move GTs which are not fit current size range to IG. (For size 'all', this step always has no effect.)\n",
    "    1. DTs which confidence score is less than or equal to $c_0$ are removed.\n",
    "    1. Match DTs with GTs greedily. We consider DTs in score order. For each DT, GTs are ordered by $IOU(DT, GT)$ in descending order.  matched DTs are true positives (TPs).\n",
    "    1. Remove unmatched DTs which can match IGs. They will have no effect on the evaluation.\n",
    "    1. Remove unmatched DTs which are not fit current size range. (For size 'all', this step always has no effect.)\n",
    "    1. Remaining DTs are FPs, Remaining GTs are false negatives (FNs).\n",
    "  1. For each size, we compute metrics in following steps, respectively.\n",
    "    1. Take all TPs, FNs, and FPs to compute `AP`.\n",
    "    1. For each character category, take TPs, FNs, and FPs in the specified category to compute average precision (AP). Compute mean of these APs weighted by the number of character instances in the corresponding category, denote the mean value as `macro-mAP`, also called `mAP`.\n",
    "    1. For each image, we take a minimum confidence score $c_0$ which leads to $num(TPs) + num(FPs) \\leq num(GTs)$, respectively. Then,\n",
    "      1. for each attribute, compute `recall` by counting TPs which $score > c_0$ and belong to the specified attribute for each image, respectively.\n",
    "      1. for each character category, compute `recall` by counting TPs which $score > c_0$ and belong to the specified category for each image, respectively.\n",
    "    1. For each image, take TPs, FNs, and FPs in the specified image to compute AP. Compute the mean of these APs, denote this mean value as `micro-mAP`.\n",
    "\n",
    "When matching a DT with a GT, we require they have the identical character category, and $IOU(DT, GT) > 0.5$. When matching a DT with a IG, we require $\\exists ig \\subseteq IG$, s.t. $IOU(DT, ig) > 0.5$. Of which $IOU(A, B) = \\frac{Area(A \\cap B)}{Area(A \\cup B)}$. Otherwise, they are not matched.\n",
    "\n",
    "When computing an AP, for any confidence score $c_0$, we compute precision by taking the maximum of each precision at confidence score $c \\geq c_0$.\n",
    "\n",
    "Computation steps above are inefficient. We write more efficient code which don't follow these steps but results in the same results for any legal detection files.\n",
    "\n",
    "Notes:\n",
    "\n",
    "  > Python API and C++ API may generate slightly different results due to floating-point precision, the difference is always less than 0.0000001%. We officially approve the results of C++ API.\n",
    "  >\n",
    "  > Even if some DTs have the same confidence score, our evaluation API produces a certain reproducible result."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
